Skip to content

Recover from UDS socket loss by making BadSocketError retryable#330

Open
joshuay03 wants to merge 1 commit into
DataDog:masterfrom
joshuay03:uds-reconnect-on-bad-socket
Open

Recover from UDS socket loss by making BadSocketError retryable#330
joshuay03 wants to merge 1 commit into
DataDog:masterfrom
joshuay03:uds-reconnect-on-bad-socket

Conversation

@joshuay03
Copy link
Copy Markdown

@joshuay03 joshuay03 commented Jun 2, 2026

We ran into this at @buildkite when the Datadog Agent restarted on one of our nodes: every Datadog::Statsd instance in our Ruby process stayed wedged on the dead socket fd and dropped custom metrics until the app restarted.

UDSConnection#send_message wraps Errno::ECONNREFUSED, Errno::ECONNRESET, and Errno::ENOENT as BadSocketError, but Connection#write's retry list did not include it, so the socket was never closed.

This PR introduces Connection::RetryableError, makes BadSocketError inherit from it, and adds it to the retry list. The existing TODO in UDSConnection#send_message and the # FIXME: BadSocketError is not correctly caught by Connection class to retry notes in the UDS specs already recommended this shape; the previously pending: true and "does not correctly retry" specs flip to positive assertions.

@joshuay03 joshuay03 force-pushed the uds-reconnect-on-bad-socket branch 3 times, most recently from 1479786 to 4f9e1af Compare June 2, 2026 01:19
UDSConnection raises BadSocketError when sendmsg_nonblock hits
ECONNREFUSED, ECONNRESET, or ENOENT, but Connection#write's retry
list did not include it, so the socket was never closed and every
subsequent send failed the same way for the lifetime of the process.

Introduce Connection::RetryableError and make BadSocketError inherit
from it. Connection#write now closes and reconnects before retrying.
@joshuay03 joshuay03 force-pushed the uds-reconnect-on-bad-socket branch from 4f9e1af to 8d2743d Compare June 2, 2026 01:23
@joshuay03 joshuay03 changed the title Recover from UDS socket loss by making BadSocketError retryable Recover from UDS socket loss by making BadSocketError retryable Jun 2, 2026
@joshuay03 joshuay03 marked this pull request as ready for review June 2, 2026 01:24
@joshuay03 joshuay03 requested a review from a team as a code owner June 2, 2026 01:24
@joshuay03 joshuay03 moved this to On Hold in Open Source Jun 2, 2026
@joshuay03 joshuay03 moved this from On Hold to In Progress / Pending Review in Open Source Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant